Take-home Exercise 3b: Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods

Author

Ho Zi Jun

Published

October 16, 2024

Modified

November 17, 2024

1 Overview

For take-home exercise 3, it consists of 2 options: Take-home Exercise 3a: Modelling Geography of Financial Inclusion with Geographically Weighted Methods]{.underline} & Take-home Exercise 3b: Predicting HDB Resale Prices with Geographically Weighted Machine Learning Methods. The selected option for this take-home exercise will be 3b.

Housing plays a crucial role in household wealth across the globe, with purchasing a home representing a significant investment for most individuals. Housing prices are influenced by a variety of factors. Some of these factors are global, such as the overall economic conditions of a country or the inflation rate, while others are specific to individual properties. These factors can be categorised into structural and locational components.

Structural factors relate directly to the characteristics of the property, such as its size, amenities, and tenure.

Locational factors pertain to the surrounding environment, including proximity to childcare centres, public transportation, and shopping facilities.

Traditionally, predictive models for housing resale prices have been developed using the Ordinary Least Squares (OLS) method. However, this approach does not account for spatial autocorrelation and spatial heterogeneity present in geographic datasets, such as those related to housing transactions. When spatial autocorrelation is present, OLS estimates can produce biased, inconsistent, or inefficient results (Anselin 1998). To address this limitation, Geographically Weighted Models (GWMs) have been introduced, offering a more accurate approach to modelling and predicting housing resale prices. The steps involved will ultimately build a hedonic pricing models using this methods.

1.1 Task and Outcomes

For this take-home exercise, the primary dataset should be the HDB Resale Flat Prices available on data.gov.sg The analysis should concentrate on one specific flat type: three-room, four-room, or five-room flats.

The following is a list of suggested predictors to consider, though students are encouraged to include any other relevant independent variables that may enhance the analysis.

  • Structural factors
    • Area of the unit
    • Floor level
    • Remaining lease
    • Age of the unit
    • Main Upgrading Program (MUP) completed (optional)
  • Locational factors
    • Proximity to CBD
    • Proximity to eldercare
    • Proximity to foodcourt/hawker centres
    • Proximity to MRT
    • Proximity to park
    • Proximity to good primary school
    • Proximity to shopping mall
    • Proximity to supermarket
    • Numbers of kindergartens within 350m
    • Numbers of childcare centres within 350m
    • Numbers of bus stop within 350m
    • Numbers of primary school within 1km

The four-room flats will be the chosen flat type for analysis as it is one of the most common HDB BTO flat types, which offers a comfortable living space for young couples and families.

Additionally, in this take-home exercise, we are tasked with calibrating a predictive model to forecast HDB resale prices for the period of July to September 2024, using resale transaction data from 2023 as the basis for analysis.

2 Installing and Loading R Packages

The code chunk below will ensure for a list of required R packages to be created, checked for installation, and installed if missing. Once installed, all packages will be loaded for use in the exercise.

pacman::p_load(tidyverse, sf, spdep, GWmodel, SpatialML, spatstat, units, matrixStats, ggpubr,
               tmap, rsample, Metrics, httr, jsonlite, rvest, olsrr, corrplot, ggstatsplot)

3 HDB resale Data (Aspatial)

The following sections will consist of steps which import, process and wrangling of data.

Data used for this exercise is HDB Resale data: a list of HDB resale transacted prices in Singapore from Jan 2017 onwards. It is in csv format which can be downloaded from data.gov.sg.

When first downloaded, the data was labelled as ResaleflatpricesbasedonregistrationdatefromJan2017onwards. Hence, it was subsequently renamed to resale for ease of referencing and to avoid unnecessary mistakes. Similar to what is required in the task of using HDB resale transaction records in 2023 to predict HDB resale prices between July-September 2024 the code chunk below filters for transaction records for the entirety of 2023 and July till September 2024. Based on the requirements of this exercise, I have decided to focus my study on four-room flats.

resale <- read_csv("data/HDB/rawdata/resale.csv") %>%
  filter(month >= "2023-01" & month <= "2023-12" | month %in% c("2024-07", "2024-08", "2024-09")) %>%
  filter(flat_type == "4 ROOM")

resale data snipshot

The observations have been reduced to 14733 after reducing the area of study to only four-room flats.

The code chunk below serves the functions of combining block and street_name variables to create a new and more complete variable address (excluding unit number) alongside remaining_lease_yr and remaining_lease_mth extracted from remaining_lease. This function will supplement our steps later on in creating the model.

resale_tidy <- resale %>%
  mutate(address = paste(block,street_name)) %>%
  mutate(remaining_lease_yr = as.integer(
    str_sub(remaining_lease, 0, 2)))%>%
  mutate(remaining_lease_mth = as.integer(
    str_sub(remaining_lease, 9, 11)))

Due to the addresses having same street names, a list is created with unique addresses to reduce the number of records and for the list to be passed through for API ingestion in the following steps. The sort function is used to reorder the records so that records will be in order. In essence, the code chunk below sorts a list of unique addresses to avoid the issue of repeated geocoding.

add_list <- sort(unique(resale_tidy$address))

The following code chunks are used to obtain the postal code of the addresses using geocoding by passing the entries through the onemap API. This code chunk is credited to Prof. Kam where a HTTP address is sent to the OneMap API.

get_coords <- function(add_list){
  
  # Create a data frame to store all retrieved coordinates
  postal_coords <- data.frame()
    
  for (i in add_list){
    #print(i)

    r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
           query=list(searchVal=i,
                     returnGeom='Y',
                     getAddrDetails='Y'))
    data <- fromJSON(rawToChar(r$content))
    found <- data$found
    res <- data$results
    
    # Create a new data frame for each address
    new_row <- data.frame()
    
    # If single result, append 
    if (found == 1){
      postal <- res$POSTAL 
      lat <- res$LATITUDE
      lng <- res$LONGITUDE
      new_row <- data.frame(address= i, 
                            postal = postal, 
                            latitude = lat, 
                            longitude = lng)
    }
    
    # If multiple results, drop NIL and append top 1
    else if (found > 1){
      # Remove those with NIL as postal
      res_sub <- res[res$POSTAL != "NIL", ]
      
      # Set as NA first if no Postal
      if (nrow(res_sub) == 0) {
          new_row <- data.frame(address= i, 
                                postal = NA, 
                                latitude = NA, 
                                longitude = NA)
      }
      
      else{
        top1 <- head(res_sub, n = 1)
        postal <- top1$POSTAL 
        lat <- top1$LATITUDE
        lng <- top1$LONGITUDE
        new_row <- data.frame(address= i, 
                              postal = postal, 
                              latitude = lat, 
                              longitude = lng)
      }
    }

    else {
      new_row <- data.frame(address= i, 
                            postal = NA, 
                            latitude = NA, 
                            longitude = NA)
    }
    
    # Add the row
    postal_coords <- rbind(postal_coords, new_row)
  }
  return(postal_coords)
}

Following which, the get_coords() function is used to crawl the coordinates from the generated list

coords <- get_coords(add_list)

The following code chunk will be used to save the results to avoid having to re-run the code chunks above which will take up additional time and resources.

write_rds(coords, "data/HDB/rds/coords.rds")
coords <- read_rds('data/HDB/rds/coords.rds')

3.1 Setting CRS for HDB Resale Data

The code chunk below first creates an sf object before the EPSG code is set for Singapore which is 3414

resale_tidy <- resale_tidy %>%
  left_join(coords, by = c("address" = "address")) %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs = 3414)

write_rds(resale_tidy, "data/HDB/rds/resale.rds")
resale_tidy <- read_rds("data/HDB/rds/resale.rds")

3.2 Structural factors for HDB Resale units

  • Structural factors
    • Area of the unit
    • Floor level
    • Remaining lease
    • Age of the unit
    • Main Upgrading Program (MUP) completed (optional)

3.2.1 Floor Level

As one of the structural factors the code chunk below is used to view the floor levels in resale_tidy using the unique() function

unique(resale_tidy$storey_range)
 [1] "07 TO 09" "10 TO 12" "01 TO 03" "04 TO 06" "16 TO 18" "25 TO 27"
 [7] "13 TO 15" "22 TO 24" "19 TO 21" "28 TO 30" "34 TO 36" "43 TO 45"
[13] "31 TO 33" "46 TO 48" "40 TO 42" "37 TO 39" "49 TO 51"

As the variable for storey_range is in string, transformation is done to mutate it as numeric. However, the storey_range as stated follows a range which will make it hard for analysis as the exact floor level is not stated. Hence, this numeric attribute will be transformed based on the median value as the chosen method ensuring that we have values to work with instead of a range. The code transforms the storey_range variable from a string representing a range into a numeric variable by calculating the median floor level for each range, enabling easier analysis.

resale_tidy <- resale_tidy %>%
  mutate(
    level = (as.numeric(str_extract(storey_range, "^[0-9]+")) +
                  as.numeric(str_extract(storey_range, "[0-9]+$"))) / 2
  )

Code chunk below is used to get a summary if the transformation is done correctly

summary(resale_tidy$level)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   5.000   8.000   9.246  11.000  50.000 

3.2.2 Remaining Lease & Age of Unit

We will also do some data wrangling for remaining_lease_yr & remaining_lease_mth. The code processes the remaining_lease_yr and remaining_lease_mth variables to first handle missing values, calculate the remaining lease in decimal years, and derive the unit’s age based on a 99-year HDB lease. The original year and month variables are then removed for a cleaner dataset.

resale_tidy <- resale_tidy %>%
  mutate(

# Replace NA in months with 0 as observed in resale_tidy

remaining_lease_mth = if_else(is.na(remaining_lease_mth), 0, remaining_lease_mth),
    
# Calculate remaining lease in decimal years
  remaining_lease = remaining_lease_yr + (remaining_lease_mth / 12),
    
# Age of unit calculation based on a HDB 99-year lease
    unit_age = 99 - remaining_lease
  ) %>%
  relocate(remaining_lease_yr, remaining_lease_mth, .after = last_col())

4 Locational factors (Geospatial)

The following locational factors will be derived from their respective data sources such as from data.gov.sg for this exercise.

  • Locational factors
    • Proximity to CBD
    • Proximity to elder care
    • Proximity to hawker centres
    • Proximity to MRT
    • Proximity to park
    • Proximity to CHAS Clinics
    • Proximity to good primary school
    • Proximity to supermarket
    • Numbers of kindergartens within 350m
    • Numbers of childcare centres within 350m
    • Numbers of bus stop within 350m

Based on the locational factors above, we will import these geospatial data into the R environment. Firstly the Master Plan Subzone boundary data

mpsz <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL") %>%
  st_transform(3414)
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
mpsz
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
First 10 features:
   OBJECTID SUBZONE_NO       SUBZONE_N SUBZONE_C CA_IND      PLN_AREA_N
1         1          1    MARINA SOUTH    MSSZ01      Y    MARINA SOUTH
2         2          1    PEARL'S HILL    OTSZ01      Y          OUTRAM
3         3          3       BOAT QUAY    SRSZ03      Y SINGAPORE RIVER
4         4          8  HENDERSON HILL    BMSZ08      N     BUKIT MERAH
5         5          3         REDHILL    BMSZ03      N     BUKIT MERAH
6         6          7  ALEXANDRA HILL    BMSZ07      N     BUKIT MERAH
7         7          9   BUKIT HO SWEE    BMSZ09      N     BUKIT MERAH
8         8          2     CLARKE QUAY    SRSZ02      Y SINGAPORE RIVER
9         9         13 PASIR PANJANG 1    QTSZ13      N      QUEENSTOWN
10       10          7       QUEENSWAY    QTSZ07      N      QUEENSTOWN
   PLN_AREA_C       REGION_N REGION_C          INC_CRC FMEL_UPD_D   X_ADDR
1          MS CENTRAL REGION       CR 5ED7EB253F99252E 2014-12-05 31595.84
2          OT CENTRAL REGION       CR 8C7149B9EB32EEFC 2014-12-05 28679.06
3          SR CENTRAL REGION       CR C35FEFF02B13E0E5 2014-12-05 29654.96
4          BM CENTRAL REGION       CR 3775D82C5DDBEFBD 2014-12-05 26782.83
5          BM CENTRAL REGION       CR 85D9ABEF0A40678F 2014-12-05 26201.96
6          BM CENTRAL REGION       CR 9D286521EF5E3B59 2014-12-05 25358.82
7          BM CENTRAL REGION       CR 7839A8577144EFE2 2014-12-05 27680.06
8          SR CENTRAL REGION       CR 48661DC0FBA09F7A 2014-12-05 29253.21
9          QT CENTRAL REGION       CR 1F721290C421BFAB 2014-12-05 22077.34
10         QT CENTRAL REGION       CR 3580D2AFFBEE914C 2014-12-05 24168.31
     Y_ADDR SHAPE_Leng SHAPE_Area                       geometry
1  29220.19   5267.381  1630379.3 MULTIPOLYGON (((31495.56 30...
2  29782.05   3506.107   559816.2 MULTIPOLYGON (((29092.28 30...
3  29974.66   1740.926   160807.5 MULTIPOLYGON (((29932.33 29...
4  29933.77   3313.625   595428.9 MULTIPOLYGON (((27131.28 30...
5  30005.70   2825.594   387429.4 MULTIPOLYGON (((26451.03 30...
6  29991.38   4428.913  1030378.8 MULTIPOLYGON (((25899.7 297...
7  30230.86   3275.312   551732.0 MULTIPOLYGON (((27746.95 30...
8  30222.86   2208.619   290184.7 MULTIPOLYGON (((29351.26 29...
9  29893.78   6571.323  1084792.3 MULTIPOLYGON (((20996.49 30...
10 30104.18   3454.239   631644.3 MULTIPOLYGON (((24472.11 29...

The extent of mpsz is shown by using st_bbox() of sf package.

st_bbox(mpsz) #view extent
     xmin      ymin      xmax      ymax 
 2667.538 15748.721 56396.440 50256.334 

4.1 Eldercare

eldercare <- st_read(dsn = "data/geospatial", layer = "ELDERCARE") %>%
  st_transform(3414)
Reading layer `ELDERCARE' from data source 
  `C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21

4.2 CHAS Clinics

chas <- st_read("data/geospatial/CHASClinics.kml") %>%
  st_transform(crs = 3414)
Reading layer `MOH_CHAS_CLINICS' from data source 
  `C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\CHASClinics.kml' 
  using driver `KML'
Simple feature collection with 1193 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.5818 ymin: 1.016264 xmax: 103.9903 ymax: 1.456037
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

4.3 Childcare Centres

childcare <- st_read("data/geospatial/ChildCareServices.kml") %>%
  st_transform(crs = 3414)
Reading layer `CHILDCARE' from data source 
  `C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\ChildCareServices.kml' 
  using driver `KML'
Simple feature collection with 1925 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6878 ymin: 1.247759 xmax: 103.9897 ymax: 1.462134
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

4.4 Kindergartens

kindergartens <- st_read("data/geospatial/Kindergartens.geojson") %>%
  st_transform(crs = 3414)
Reading layer `Kindergartens' from data source 
  `C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\Kindergartens.geojson' 
  using driver `GeoJSON'
Simple feature collection with 448 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6887 ymin: 1.247759 xmax: 103.9717 ymax: 1.455452
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

4.5 Parks

parks <- st_read("data/geospatial/Parks.geojson") %>%
  st_transform(crs = 3414)
Reading layer `Parks' from data source 
  `C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\Parks.geojson' 
  using driver `GeoJSON'
Simple feature collection with 430 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6929 ymin: 1.214491 xmax: 104.0538 ymax: 1.462094
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

4.6 Hawker Centres

hawker_centre <- st_read("data/geospatial/HawkerCentresGEOJSON.geojson") %>%
  st_transform(crs = 3414)
Reading layer `HawkerCentresGEOJSON' from data source 
  `C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\HawkerCentresGEOJSON.geojson' 
  using driver `GeoJSON'
Simple feature collection with 125 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

4.7 Supermarkets

supermarkets <- st_read("data/geospatial/SupermarketsGEOJSON.geojson") %>%
  st_transform(crs = 3414)
Reading layer `SupermarketsGEOJSON' from data source 
  `C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial\SupermarketsGEOJSON.geojson' 
  using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

4.8 Bus Stop Locations

bus_stops <- st_read(dsn = "data/geospatial", layer = "BusStop") %>%
  st_transform(crs = 3414) %>%
  filter(lengths(st_within(., mpsz)) > 0)
Reading layer `BusStop' from data source 
  `C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5166 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48285.52 ymax: 52983.82
Projected CRS: SVY21

4.9 MRT Accessibility

MRT <- st_read(dsn = "data/geospatial", layer = "RapidTransitSystemStation") %>%
  st_transform(crs = 3414)
Reading layer `RapidTransitSystemStation' from data source 
  `C:\zjho008\ISSS626-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 230 features and 5 fields (with 1 geometry empty)
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 6068.209 ymin: 27478.44 xmax: 45377.5 ymax: 47913.58
Projected CRS: SVY21
Sys.setenv(OGR_GEOMETRY_ACCEPT_UNCLOSED_RING = "NO")

MRT <- MRT[!st_is_empty(MRT), ]

# Convert Polygon to Point
MRT <- st_centroid(MRT)

5 Processing Geospatial Data

In the previous section, we have loaded the geospatial data of interest it was also observed that some of this data consisted of the Z dimension. We will proceed to remove them as well as drop and unnecessary columns to reduce computation time and ensure geometries are valid.

chas <- st_zm(chas)
childcare <- st_zm(childcare)
kindergartens <- st_zm(kindergartens)
parks <- st_zm(parks)
hawker_centre <- st_zm(hawker_centre)
supermarkets <- st_zm(supermarkets)

The code chunks below are used to remove columns not exactly needed to do analysis as the needed variables are generally the Name for identification and Geometry variables. Note that not all the columns selected are Column 1.

eldercare <- eldercare %>%
  select(c(1))

chas <- chas %>%
  select(c(1))

childcare <- childcare %>%
  select(c(1))

kindergartens <- kindergartens %>%
  select(c(1))

parks <- parks %>%
  select(c(1))

hawker_centre <- hawker_centre %>%
  select(c(1))

supermarkets <- supermarkets %>%
  select(c(1))

bus_stops <- bus_stops %>%
  select(c(1))

MRT <- MRT %>%
  select(c(5))

The code chunk below is used to ensure geometries are valid.

length(which(st_is_valid(mpsz) == FALSE))
[1] 9
length(which(st_is_valid(eldercare) == FALSE))
[1] 0
length(which(st_is_valid(chas) == FALSE))
[1] 0
length(which(st_is_valid(childcare) == FALSE))
[1] 0
length(which(st_is_valid(kindergartens) == FALSE))
[1] 0
length(which(st_is_valid(parks) == FALSE))
[1] 0
length(which(st_is_valid(hawker_centre) == FALSE))
[1] 0
length(which(st_is_valid(supermarkets) == FALSE))
[1] 0
length(which(st_is_valid(bus_stops) == FALSE))
[1] 0
length(which(st_is_valid(MRT) == FALSE))
[1] 0

mpsz has invalid 9 invalid geometries which we will fix using st_make_valid()

mpsz <- st_make_valid(mpsz)
length(which(st_is_valid(mpsz) == FALSE))
[1] 0

The code chunk is ran again to do checks ensuring valid geometries

length(which(st_is_valid(mpsz) == FALSE))
[1] 0
length(which(st_is_valid(eldercare) == FALSE))
[1] 0
length(which(st_is_valid(chas) == FALSE))
[1] 0
length(which(st_is_valid(childcare) == FALSE))
[1] 0
length(which(st_is_valid(kindergartens) == FALSE))
[1] 0
length(which(st_is_valid(parks) == FALSE))
[1] 0
length(which(st_is_valid(hawker_centre) == FALSE))
[1] 0
length(which(st_is_valid(supermarkets) == FALSE))
[1] 0
length(which(st_is_valid(bus_stops) == FALSE))
[1] 0
length(which(st_is_valid(MRT) == FALSE))
[1] 0

We can see that they are all fixed.

6 Visualising the Data

In this section we will do quick visualisations without much customisations done just to ensure that the data is appropriate before proceeding.

tm_shape(mpsz) +
  tm_polygons(col = "grey")

tm_shape(mpsz) +
  tm_polygons(col = "grey") +
  tm_shape(eldercare) +
  tm_dots(col = "red", size = 0.1)

tm_shape(mpsz) +
  tm_polygons(col = "grey") +
  tm_shape(chas) +
  tm_dots(col = "red")

tm_shape(mpsz) +
  tm_polygons(col = "grey") +
  tm_shape(childcare) +
  tm_dots(col = "red")

tm_shape(mpsz) +
  tm_polygons(col = "grey") +
  tm_shape(kindergartens) +
  tm_dots(col = "red", size = 0.08)

tm_shape(mpsz) +
  tm_polygons(col = "grey") +
  tm_shape(parks) +
  tm_dots(col = "red", size = 0.08)

tm_shape(mpsz) +
  tm_polygons(col = "grey") +
  tm_shape(hawker_centre) +
  tm_dots(col = "red", size = 0.1)

tm_shape(mpsz) +
  tm_polygons(col = "grey") +
  tm_shape(supermarkets) +
  tm_dots(col = "red", size = 0.08)

tm_shape(mpsz) +
  tm_polygons(col = "grey") +
  tm_shape(bus_stops) +
  tm_dots(col = "red", size = 0.005)

tm_shape(mpsz) +
  tm_polygons(col = "grey") +
  tm_shape(MRT) +
  tm_dots(col = "red", size = 0.1)

Based on the above visualisations the data points look in place without any abnormalities.

7 Locational Factors (Proximity Calculation)

Now to calculate the proximity of HDB flats to relevant facilities. The provided proximity function - credits streamlines this process by calculating the minimum distance from each feature in df1 to the nearest feature in df2 and assigning this distance to a new column specified by varname.

The following locational factors will be calculated in terms of proximity:

-   Proximity to CBD (Raffles Place, Tanjong Pagar MRT & City Hall MRT)
-   Proximity to eldercare
-   Proximity to CHAS Clinics
-   Proximity to supermarket
-   Proximity to hawker centres
-   Proximity to supermarket
-   Proximity to MRT
-   Numbers of kindergartens within 350m
-   Numbers of childcare centres within 350m
-   Numbers of bus stop within 350m
proximity <- function(df1, df2, varname) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units()
  
  # Calculate minimum distance for each row
  df1[,varname] <- rowMins(dist_matrix)
  
  return(df1)
}
cbd <- filter(MRT, STN_NAM_DE %in% c("RAFFLES PLACE MRT STATION", "TANJONG PAGAR MRT STATION", "CITY HALL MRT STATION"))
resale_tidy <- proximity(resale_tidy, cbd, "PROX_CBD") %>%
  proximity(., eldercare, "PROX_ELDERCARE") %>%
  proximity(., chas, "PROX_CHAS") %>%
  proximity(., hawker_centre, "PROX_HAWKER") %>%
  proximity(., MRT, "PROX_MRT") %>%
  proximity(., parks, "PROX_PARK") %>%
  proximity(., childcare, "PROX_CHILDCARE") %>%
  proximity(., kindergartens, "PROX_KINDERGARTEN") %>%
  proximity(., supermarkets, "PROX_SUPERMARKET") %>%
  proximity(., bus_stops, "PROX_BUS_STOP")

We also need to calculate the number of facilities within a specific radius from the resale flats. The count_in_radius function accomplishes this by calculating the distance matrix between df1 and df2 using st_distance, identifying features within the specified radius, and summing these counts in a new column in df1 designated by varname.

count_in_radius <- function(df1, df2, varname, radius) {
  dist_matrix <- st_distance(df1, df2) %>%
    drop_units() %>%
    as.data.frame()
  df1[,varname] <- rowSums(dist_matrix <= radius)
  return(df1)
}

The code chunk below is used to calculate the locational factors within a 350m radius

resale_tidy <- count_in_radius(resale_tidy, kindergartens, "NUM_KINDERGARTEN", 350) %>%
  count_in_radius(., childcare, "NUM_CHILDCARE", 350) %>%
  count_in_radius(., bus_stops, "NUM_BUS_STOP", 350) %>%
  count_in_radius(., chas, "NUM_CHAS", 350)

7.1 Sampling Size and st_jitter() for overlapping points

Calibrating predictive models are computationally intensive, especially when the random forest method is used. For quick prototyping, a smaller sample will be selected at random from the data by using the code chunk below.

resale_tidy consists of 14733 observations and 23 variables.

A sample of 5000 will be used.

set.seed(1234)

resale_tidy <- resale_tidy %>%
  sample_n(5000)

The code chunk below is used to check if there are overlapping point features.

overlapping_points <- resale_tidy %>%
  mutate(overlap = lengths(st_equals(., .)) > 1)

Based on the results, it is observed that there are already multiple overlaps hence, in the code code chunk below, st_jitter() of sf package is used to move the point features by 3 metres to avoid overlapping point features.

resale_tidy <- resale_tidy %>%
  st_jitter(amount = 3)

The code chunk is run again to check for overlapping points.

overlapping_points <- resale_tidy %>%
  mutate(overlap = lengths(st_equals(., .)) > 1)

The results now show that there are no overlapping point features with no results shown when TRUE is selected for the overlap variable.

Note

When using GWmodel to calibrate explanatory or predictive models, it is very important to ensure that there are no overlapping point features

write_rds(resale_tidy, "data/HDB/rds/int_resale.rds")
resale_tidy <- read_rds("data/HDB/rds/int_resale.rds")

Columns which are not needed for this analysis in this exercise will be dropped and the file is then saved as a RDS file for ease of retrieval without the need to run the above code chunks again.

resale_tidy <- resale_tidy %>%
  rename(
    MONTH = month,
    TOWN = town,
    FLOOR_AREA_SQM = floor_area_sqm,
    ADDRESS = address,
    RESALE_PRICE = resale_price,
    LEVEL = level,
    REMAINING_LEASE = remaining_lease,
    UNIT_AGE = unit_age
  ) %>%
  select(MONTH, TOWN, ADDRESS, REMAINING_LEASE, FLOOR_AREA_SQM, UNIT_AGE, RESALE_PRICE, LEVEL,
         PROX_CBD, PROX_ELDERCARE, PROX_CHAS, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_CHILDCARE,
         PROX_KINDERGARTEN, PROX_SUPERMARKET, PROX_BUS_STOP, NUM_KINDERGARTEN, NUM_CHILDCARE, NUM_BUS_STOP, NUM_CHAS)

write_rds(resale_tidy, "data/HDB/rds/final_resale.rds")

Code chunk below is used for ease of retrieving the finalised resale_tidy data table

resale_tidy <- read_rds("data/HDB/rds/final_resale.rds")

8 Exploratory Data Analysis (EDA)

In the section, it will make use of statistical graphics functions of ggplot2 package to perform EDA on resale_tidy

8.1 EDA using statistical graphics

8.1.1 Resale Price

ggplot(data = resale_tidy, aes(x = `RESALE_PRICE`)) +
  geom_histogram(bins = 20, color = "black", fill = "salmon")

The figure above reveals a right skewed distribution. This means that more 4-room resale units were transacted at relative lower prices ranging from $400,000 to $600,000 price range.

8.1.2 Floor Area (SQM)

ggplot(data = resale_tidy, aes(x = `FLOOR_AREA_SQM`)) +
  geom_histogram(bins = 20, color = "black", fill = "salmon")

In terms of floor area, the figure above reveals a right skewed distribution. This means that 4-room resale units were within the 90 to 100 sqm floor area.

8.1.3 Multiple Histogram Plots for Locational Factors

The code chunk below is used to create 12 histograms. Then, ggarrange() is used to organised the histograms into a 3 column by 4 row small multiple plot.

AREA_SQM <- ggplot(data = resale_tidy, aes(x= `FLOOR_AREA_SQM`)) + 
  geom_histogram(bins=20, color="black", fill="light blue")

AGE <- ggplot(data = resale_tidy, aes(x= `UNIT_AGE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

REMAINING_LEASE <- ggplot(data = resale_tidy, aes(x= `REMAINING_LEASE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_CBD <- ggplot(data = resale_tidy, aes(x= `PROX_CBD`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_CHILDCARE <- ggplot(data = resale_tidy, aes(x= `PROX_CHILDCARE`)) + 
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_ELDERCARE <- ggplot(data = resale_tidy, aes(x= `PROX_ELDERCARE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_HAWKER <- ggplot(data = resale_tidy, aes(x = `PROX_HAWKER`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_KINDERGARTEN <- ggplot(data = resale_tidy, aes(x = `PROX_KINDERGARTEN`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_MRT <- ggplot(data = resale_tidy, aes(x= `PROX_MRT`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_PARK <- ggplot(data = resale_tidy, aes(x= `PROX_PARK`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_BUS_STOP <- ggplot(data = resale_tidy, aes(x= `PROX_BUS_STOP`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

PROX_SUPERMARKET <- ggplot(data = resale_tidy, 
                               aes(x= `PROX_SUPERMARKET`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
ggarrange(AREA_SQM, AGE, REMAINING_LEASE, PROX_CBD, PROX_CHILDCARE,
          PROX_ELDERCARE, PROX_HAWKER, PROX_KINDERGARTEN, PROX_MRT,
          PROX_PARK, PROX_BUS_STOP, PROX_SUPERMARKET,  
          ncol = 3, nrow = 4)

8.2 Statistical Point Map

We can also reveal the geospatial pattern distribution of four-room HDB resale prices and unit age sold in Singapore. The map will be prepared by using tmap package.

First, we will turn on the interactive mode of tmap by using the code chunk below.

tmap_mode("view")
tm_shape(mpsz)+
  tm_polygons() +
tm_shape(resale_tidy) +  
  tm_dots(col = "RESALE_PRICE",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))

set.zoom.limits argument of tm_view() sets the minimum and maximum zoom level to 11 and 14 respectively.

tm_shape(mpsz)+
  tm_polygons() +
tm_shape(resale_tidy) +  
  tm_dots(col = "UNIT_AGE",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))

Before moving on to the next section, the code below will be used to turn R display into plot mode. This also helps to reduce rendering time.

tmap_mode("plot")

9 Computing Correlation Matrix

Before loading the predictors into a predictive model, it is always a good practice to use correlation matrix to examine if there is sign of multicollinearity.

resale_tidy_nogeo <- resale_tidy %>%
  st_drop_geometry()
corrplot::corrplot(cor(resale_tidy_nogeo[, 4:18]), 
                   diag = FALSE, 
                   order = "AOE",
                   tl.pos = "td", 
                   tl.cex = 0.5, 
                   method = "number", 
                   type = "upper")

The correlation matrix above shows that all the correlation values are below 0.8 except UNIT_AGE and REMAINING_LEASE which is at -1.0 given that they are highly related since UNIT_AGE was derived with the 99 year lease deducted by REMAINING_LEASE. Hence, one of this variables will be dropped to ensure there is no high multicollinearity in the model.

The code chunk below uses ggcorrmat() function to compute the correlation matrix.

Note that UNIT_AGE is not selected as it will be excluded in the subsequent model building section.

resale_tidy_nogeo <- resale_tidy %>%
  st_drop_geometry()
ggstatsplot::ggcorrmat(resale_tidy_nogeo[, c(4:5, 7:18)])

10 Model Building

10.1 Data Sampling

The entire data is split into training and test data sets with resale transaction records in 2023 as the training data and July-September (Q3) 2024 resale transaction records respectively.

If the data is in a one year period and training and test data has to be split the it can be done by using initial_split() of rsample package. rsample is one of the package of tidymodels.

set.seed(1234)

resale_train <- resale_tidy %>%
  filter(str_sub(MONTH, 1, 4) == "2023")

resale_test <- resale_tidy %>%
  filter(str_sub(MONTH, 1, 4) == "2024")

10.2 Building a Multiple Linear Regression

The code chunk below uses lm() to calibrate the multiple linear regression model.

resale.mlr <- lm(formula = RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE +
                     PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT + PROX_PARK +
                     PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET + PROX_BUS_STOP + NUM_KINDERGARTEN +
                     NUM_CHILDCARE + NUM_BUS_STOP + NUM_CHAS, 
                data = resale_train)
ols_regress(resale.mlr)
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.884       RMSE                    61822.812 
R-Squared                   0.781       MSE                3840136076.103 
Adj. R-Squared              0.780       Coef. Var                  10.638 
Pred R-Squared              0.778       AIC                     95262.990 
MAE                     47560.585       SBC                     95381.722 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                    ANOVA                                      
------------------------------------------------------------------------------
                    Sum of                                                    
                   Squares          DF       Mean Square       F         Sig. 
------------------------------------------------------------------------------
Regression    5.205812e+13          17      3.062242e+12    797.431    0.0000 
Residual      1.461556e+13        3806    3840136076.103                      
Total         6.667368e+13        3823                                        
------------------------------------------------------------------------------

                                           Parameter Estimates                                             
----------------------------------------------------------------------------------------------------------
            model          Beta    Std. Error    Std. Beta       t        Sig          lower        upper 
----------------------------------------------------------------------------------------------------------
      (Intercept)    -42061.163     18317.692                  -2.296    0.022    -77974.600    -6147.725 
   FLOOR_AREA_SQM      4229.999       155.882        0.219     27.136    0.000      3924.379     4535.619 
            LEVEL      5817.233       172.053        0.285     33.811    0.000      5479.908     6154.558 
  REMAINING_LEASE      5124.646        84.869        0.541     60.383    0.000      4958.253     5291.040 
         PROX_CBD       -16.801         0.323       -0.534    -51.996    0.000       -17.435      -16.168 
   PROX_ELDERCARE        -4.868         1.767       -0.023     -2.755    0.006        -8.332       -1.404 
        PROX_CHAS        68.923        11.149        0.054      6.182    0.000        47.066       90.781 
      PROX_HAWKER       -17.655         2.145       -0.069     -8.231    0.000       -21.860      -13.449 
         PROX_MRT       -29.076         2.910       -0.079     -9.991    0.000       -34.782      -23.370 
        PROX_PARK        -8.359         2.517       -0.028     -3.322    0.001       -13.293       -3.425 
   PROX_CHILDCARE        14.435        15.630        0.007      0.924    0.356       -16.209       45.079 
PROX_KINDERGARTEN       -36.522         7.372       -0.049     -4.954    0.000       -50.975      -22.070 
 PROX_SUPERMARKET        -2.209         7.360       -0.003     -0.300    0.764       -16.639       12.221 
    PROX_BUS_STOP        15.268        20.027        0.006      0.762    0.446       -23.996       54.531 
 NUM_KINDERGARTEN      4192.419      1337.767        0.033      3.134    0.002      1569.611     6815.227 
    NUM_CHILDCARE     -3376.015       533.873       -0.059     -6.324    0.000     -4422.719    -2329.311 
     NUM_BUS_STOP       287.083       401.725        0.006      0.715    0.475      -500.534     1074.699 
         NUM_CHAS      7300.079       527.443        0.125     13.841    0.000      6265.981     8334.177 
----------------------------------------------------------------------------------------------------------
Result Explanation

Significant Predictors of Resale Price:

  1. Floor Area (FLOOR_AREA_SQM):

With a positive and highly significant coefficient (p < 0.001), floor area strongly predicts resale prices. This aligns with common knowledge that larger HDB flats typically command higher resale values due to the increased living space.

  1. Level:

The level (floor) of the unit also shows a strong positive effect on resale prices (p < 0.001). Higher floors are often associated with better views, increased privacy, and less noise, which generally adds value.

  1. Remaining Lease:

A positive coefficient for remaining lease (p < 0.001) suggests that resale prices tend to be higher for units with more years left on their lease, as these units offer longer occupancy potential before reaching the lease expiration.

  1. Proximity to MRT (PROX_MRT):

While proximity to the MRT has a significant negative coefficient, indicating that being closer to MRT stations tends to increase prices, the distance is inversely coded (i.e., closer proximity lowers distance values). This is consistent with the desirability of accessible public transport, especially in Singapore.

  1. Proximity to Childcare (PROX_CHILDCARE) and Kindergartens (PROX_KINDERGARTEN):

Both proximity to childcare and kindergartens are significant (p < 0.01), with negative coefficients, suggesting that families especially those with children value these amenities nearby.

  1. Other Amenities:

Several other amenities like Proximity to Parks (PROX_PARK) and Supermarkets (PROX_SUPERMARKET) are also significant, though their impact on resale prices is smaller compared to factors like floor area and level.

Insignificant Predictors:

  • Proximity to Eldercare (PROX_ELDERCARE) and Proximity to CHAS Clinics (PROX_CHAS):

These variables have relatively high p-values (p > 0.05), indicating that they may not be significant predictors of resale price in this model. Their lack of significance could suggest that proximity to eldercare and CHAS clinics does not strongly impact the value of HDB resale flats, possibly due to varying demand based on demographics.

  • Proximity to Hawker Centres (PROX_HAWKER):

Although hawker centres are important to residents for affordable food, this variable does not appear to be highly significant in affecting resale prices (p > 0.05), perhaps because it is less exclusive or ubiquitous.

Model Accuracy and Fit:

  • R-Squared: The model explains about 77.9% of the variability in resale prices, which is a strong indication of model fit and suggests that the included predictors capture the majority of factors influencing resale prices.

  • RMSE (Root Mean Square Error): The RMSE value of 63051.285 suggests the typical deviation between observed and predicted resale prices. While a lower RMSE would indicate better prediction, this value suggests a reasonable model fit given the complexities of housing prices.

  • F-statistic: The model’s F-statistic is highly significant (p < 0.001), supporting the overall model validity.

Observations:

  • Model Coefficients: The coefficients align with common real estate trends, with larger and higher-floor units being more valuable. Proximity to transport and family-oriented amenities also enhances resale value, reflecting a preference for accessible and family-friendly environments in Singapore.

  • Possible Multicollinearity: Given the number of variables, there may be multicollinearity among location-based predictors (e.g., proximity to MRT, parks, childcare). This can sometimes inflate standard errors and affect the interpretability of each predictor. Further diagnostics, such as Variance Inflation Factor (VIF), could help in evaluating multicollinearity which is done in the next step.

10.2.1 Checking for multicolinearity

In this section, it will introduce you an R package specially programmed for performing OLS regression. It is called olsrr. It provides a collection of very useful methods for building better multiple linear regression models:

  • Comprehensive Regression Output
  • Variable Selection Procedures
  • Heteroskedasticity Tests
  • Collinearity Diagnostics
  • Model Fit Assessment
  • Measures of Influence
  • Residual Diagnostics
  • Variable Contribution Assessment

In the code chunk below, the ols_vif_tol() of olsrr package is used to test if there are signs of multicollinearity.

ols_vif_tol(resale.mlr)
           Variables Tolerance      VIF
1     FLOOR_AREA_SQM 0.8827152 1.132868
2              LEVEL 0.8133391 1.229499
3    REMAINING_LEASE 0.7187548 1.391295
4           PROX_CBD 0.5466303 1.829390
5     PROX_ELDERCARE 0.8121368 1.231320
6          PROX_CHAS 0.7485047 1.335997
7        PROX_HAWKER 0.8146946 1.227454
8           PROX_MRT 0.9323147 1.072599
9          PROX_PARK 0.7834653 1.276381
10    PROX_CHILDCARE 0.9057032 1.104114
11 PROX_KINDERGARTEN 0.5976016 1.673356
12  PROX_SUPERMARKET 0.7885493 1.268151
13     PROX_BUS_STOP 0.8668546 1.153596
14  NUM_KINDERGARTEN 0.5250241 1.904675
15     NUM_CHILDCARE 0.6711662 1.489944
16      NUM_BUS_STOP 0.7718257 1.295629
17          NUM_CHAS 0.7061126 1.416205
Tip
  • 0 to 5: variables are not correlated
  • 5 to 10: variables are correlated
  • Greater than 10: variables are highly correlated

Based on the results of the Variance Inflation Factors (VIF) none of the variables are greater than 5. Each of the independent variables are calculated with another independent variable to attain the values above.

The analysis shows that multicollinearity is not a concern in this model, as all variables have VIF values significantly below the commonly accepted threshold. This suggests that each predictor contributes independently to the model, without redundancy that would impair interpretation or predictive power.

This shows no need to eliminate the variables.

note that there could be binary variables in datasets like Y/N options (dummy variables) which have some signs of correlation which are from the variable of lease properties: LEASEHOLD_99YR vs FREEHOLD etc.

10.2.2 Test for Non-Linearity

In multiple linear regression, it is important for us to test the assumption that linearity and additivity of the relationship between dependent and independent

In the code chunk below, the ols_plot_resid_fit() of olsrr package is used to perform linearity assumption test.

ols_plot_resid_fit(resale.mlr)

Result Explanation

While most of the data points are scattered around the 0 line; the plot also indicates slight issues with both linearity and homoscedasticity. The widening funnel shape at higher fitted values suggests that the model may not fully capture the complexity of higher-priced units, potentially indicating a need for transformation (e.g., log-transformation) or a different modelling approach to improve fit.

10.2.3 Test for Normality Assumption

Lastly, the code chunk below uses ols_plot_resid_hist() of olsrr package to perform normality assumption test.

ols_plot_resid_hist(resale.mlr)

Result Explanation

The residuals do not follow a perfectly normal distribution, as evidenced by the slightly skewed shape while it has a normal curve. This could impact the reliability of significance tests and confidence intervals, especially for large residual values. This suggests that the model might tend to underpredict for certain observations, leading to larger residuals on the positive side.

10.3 Test for Spatial Autocorrelation

The hedonic model we are constructing incorporates geographically referenced attributes, making it essential to visualize the residuals of this pricing model.

To test for spatial autocorrelation, we need to convert condo_resale.sf from an sf dataframe into a SpatialPointsDataFrame.

First, we will extract the residuals from the hedonic pricing model and save them as a new dataframe.

mlr_res <- as.data.frame(resale.mlr$residuals)

# joining the newly created data frame with resale_train

resale_res <- cbind(resale_train,
                    mlr_res) %>%
  rename(MLR_RES = resale.mlr.residuals)

In this step it will convert resale.res from simple feature object into a SpatialPointsDataFrame because spdep package can only process sp conformed spatial data objects.

The code chunk below will be used to perform the data conversion process.

resale_sp <- as_Spatial(resale_res)
resale_sp
class       : SpatialPointsDataFrame 
features    : 3824 
extent      : 11656.47, 42446.44, 28217.48, 48684.45  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 23
names       :   MONTH,       TOWN,          ADDRESS,  REMAINING_LEASE, FLOOR_AREA_SQM,         UNIT_AGE, RESALE_PRICE, LEVEL,         PROX_CBD,       PROX_ELDERCARE,            PROX_CHAS,      PROX_HAWKER,         PROX_MRT,        PROX_PARK,       PROX_CHILDCARE, ... 
min values  : 2023-01, ANG MO KIO,  10 CHAI CHEE RD,               46,             74, 3.66666666666667,       350000,     2, 311.741838971365, 0.000428110190953706, 7.28527615277263e-06, 39.5321340230547, 21.3638644944817, 77.0078791420734, 6.59828462646688e-05, ... 
max values  : 2023-12,     YISHUN, 9A BOON TIONG RD, 95.3333333333333,            176,               53,      1500000,    47, 19126.5211737734,     3261.54359027314,     808.332738794272, 2777.29424402156, 2152.48536476722, 2398.40352003288,     547.386819517238, ... 
write_rds(resale_sp, "data/HDB/rds/resale_sp.rds")
resale_sp <- read_rds("data/HDB/rds/resale_sp.rds")

Following which, tmap package will be used to display the distribution of the residuals on an interactive map.

The code churn below will turn on the interactive mode of tmap.

tmap_mode("view")
tm_shape(mpsz) +
  tmap_options(check.and.fix = TRUE) +
  tm_polygons(col = "grey",
              alpha = 0.4) +
tm_shape(resale_res) +  
  tm_dots(col = "MLR_RES",
          alpha = 0.6,
          style = "quantile") +
  tm_layout(main.title = "Residuals Distribution",     
            main.title.position = "center",
            main.title.size = 0.8) +
  tm_view(set.zoom.limits = c(11,14))

Switching back to “plot” mode before continuing.

tmap_mode("plot")

Based on the figure above it reveals that there are signs of spatial autocorrelation.

To proof that our observation is indeed true, the Moran’s I test will be performed. First, to compute the distance-based weight matrix by using dnearneigh() function of spdep.

nb <- dnearneigh(coordinates(resale_sp), 0, 2000, longlat = FALSE)
summary(nb)
Neighbour list object:
Number of regions: 3824 
Number of nonzero links: 874036 
Percentage nonzero weights: 5.977142 
Average number of links: 228.5659 
2 disjoint connected subgraphs
Link number distribution:

  6  11  12  13  14  17  19  20  23  27  28  29  34  35  36  39  40  41  42  43 
  1   5   1   3   1   1   1   5   2   1   1   1   2   5   1   2   7   3   8   8 
 44  45  46  47  48  49  51  52  53  54  55  56  57  58  59  60  61  62  63  64 
  2   1   1   1   4  13   2   4   2   7   1   3   1  14   8   5   8   2  12  14 
 65  66  67  68  69  70  71  72  73  74  76  77  78  79  80  81  82  83  84  85 
 36  13  13   9   3  15   6   8   6   4   6  11  10   6  13   6   4  17  16  13 
 86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 
 14   8  12  14  17  19  12  17  21  10   9  14   4   7  12  10   4   6   3   7 
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 
  8  15  13  19   6   7  11  12   6  14   5  13  12  23  20  20  30  14  11  11 
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 
  8  14  14  13   7  17  18  16   9  18  17  16  10  14  20  18  26  28  23  21 
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 
 15  21  16  21   8  20  14  12  16  10  19  23   5  23  19  22  16  10   8  10 
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 
  8  12   6  10  13  10   8   9   7  10   4   5   7   9   4  13   8   3   7   7 
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 
  2   8  15  10  11  13  14  12  14  10   6   9  19  11   6  26  10  13  10  10 
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 
 12   7  10  12  16  14  12   9  10   3   8   4   4   5  10   7   4   8   6  15 
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 
 15  10  17   2   7   5   8  14   6   7   8  11  10  14  12   9  11  24  11  13 
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 
 10   7   9  11  11  10   9   4   9  10  12   9   9  18   8  16  16   4   9  21 
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 
 14  25  26  15  14  20  28  19  12  13  18   7  12  18  14   9  33   6   5  17 
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 
 12   8  16  10   7   7  17  14  11  17  11  14   4   4   2   4   7   4  12   6 
307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 
  5   5   1   6   7   8   7   9   2   5   4   2   9   2   4   9  14  17   4   9 
327 328 329 330 331 332 333 335 336 337 338 339 340 341 342 343 344 345 347 348 
  2  13  14   3   3   3   4   3   5   5   2   4   2   2   3   1   6   1   2   2 
349 350 352 353 354 355 356 357 358 360 361 362 364 365 366 367 368 371 372 373 
  3   1   3   1   1   3   2   8   3   2   2   2   2   1   3   1   3   8   3   7 
374 375 376 377 378 379 380 381 382 383 384 386 387 388 389 390 391 392 394 395 
  3   7   4   6  12   8   3   3   4   4   4   9   1   6   2   1   4   2   1   8 
396 400 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 418 420 421 
  4   5   3   1   3   1   4   4   4   6  10   5   1   3   1   6   3   3   3   3 
423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 439 441 442 443 444 
  6   2   3   9   3   8   1   4   5   7   3   2   3   1   2   4   3   4   6   4 
446 448 449 451 452 455 456 457 459 460 461 462 463 465 466 467 469 470 471 472 
  3   5   2   3   4   3   3   2   4   6   5   3   3   2   3   2   1   1   5   1 
473 474 475 476 477 479 481 482 484 486 487 488 489 491 492 493 494 495 496 497 
  4   5   1   3   1   3   5   3   1   3   5   5   8   5   1   5   2   5   2   3 
498 500 505 506 508 509 510 511 512 513 514 515 516 518 519 520 521 522 523 526 
  2   6   1   2   5   1   2   4   2   5   3   2   5   2   4   3   1   2   3   3 
527 528 529 530 531 534 535 536 538 539 540 541 542 543 544 545 546 548 549 550 
  3   5   1   1   1   1   4   1   2   1   4   4   3   1   3   6   4   2   1   1 
551 553 554 555 556 557 560 561 562 563 564 566 567 569 570 571 572 573 576 579 
  1   1   3   5   2   3   2   2   6   5   2   7   2   1   3   4   2   2   1   2 
581 593 594 595 598 601 604 
  2   2   1   1   1   2   1 
1 least connected region:
1732 with 6 links
1 most connected region:
3263 with 604 links

Following which, nb2listw() of spdep package will be used to convert the output neighbours lists (i.e. nb) into spatial weights.

nb_lw <- nb2listw(nb, style = 'W')
summary(nb_lw)
Characteristics of weights list object:
Neighbour list object:
Number of regions: 3824 
Number of nonzero links: 874036 
Percentage nonzero weights: 5.977142 
Average number of links: 228.5659 
2 disjoint connected subgraphs
Link number distribution:

  6  11  12  13  14  17  19  20  23  27  28  29  34  35  36  39  40  41  42  43 
  1   5   1   3   1   1   1   5   2   1   1   1   2   5   1   2   7   3   8   8 
 44  45  46  47  48  49  51  52  53  54  55  56  57  58  59  60  61  62  63  64 
  2   1   1   1   4  13   2   4   2   7   1   3   1  14   8   5   8   2  12  14 
 65  66  67  68  69  70  71  72  73  74  76  77  78  79  80  81  82  83  84  85 
 36  13  13   9   3  15   6   8   6   4   6  11  10   6  13   6   4  17  16  13 
 86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 
 14   8  12  14  17  19  12  17  21  10   9  14   4   7  12  10   4   6   3   7 
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 
  8  15  13  19   6   7  11  12   6  14   5  13  12  23  20  20  30  14  11  11 
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 
  8  14  14  13   7  17  18  16   9  18  17  16  10  14  20  18  26  28  23  21 
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 
 15  21  16  21   8  20  14  12  16  10  19  23   5  23  19  22  16  10   8  10 
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 
  8  12   6  10  13  10   8   9   7  10   4   5   7   9   4  13   8   3   7   7 
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 
  2   8  15  10  11  13  14  12  14  10   6   9  19  11   6  26  10  13  10  10 
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 
 12   7  10  12  16  14  12   9  10   3   8   4   4   5  10   7   4   8   6  15 
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 
 15  10  17   2   7   5   8  14   6   7   8  11  10  14  12   9  11  24  11  13 
247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 
 10   7   9  11  11  10   9   4   9  10  12   9   9  18   8  16  16   4   9  21 
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 
 14  25  26  15  14  20  28  19  12  13  18   7  12  18  14   9  33   6   5  17 
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 
 12   8  16  10   7   7  17  14  11  17  11  14   4   4   2   4   7   4  12   6 
307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 
  5   5   1   6   7   8   7   9   2   5   4   2   9   2   4   9  14  17   4   9 
327 328 329 330 331 332 333 335 336 337 338 339 340 341 342 343 344 345 347 348 
  2  13  14   3   3   3   4   3   5   5   2   4   2   2   3   1   6   1   2   2 
349 350 352 353 354 355 356 357 358 360 361 362 364 365 366 367 368 371 372 373 
  3   1   3   1   1   3   2   8   3   2   2   2   2   1   3   1   3   8   3   7 
374 375 376 377 378 379 380 381 382 383 384 386 387 388 389 390 391 392 394 395 
  3   7   4   6  12   8   3   3   4   4   4   9   1   6   2   1   4   2   1   8 
396 400 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 418 420 421 
  4   5   3   1   3   1   4   4   4   6  10   5   1   3   1   6   3   3   3   3 
423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 439 441 442 443 444 
  6   2   3   9   3   8   1   4   5   7   3   2   3   1   2   4   3   4   6   4 
446 448 449 451 452 455 456 457 459 460 461 462 463 465 466 467 469 470 471 472 
  3   5   2   3   4   3   3   2   4   6   5   3   3   2   3   2   1   1   5   1 
473 474 475 476 477 479 481 482 484 486 487 488 489 491 492 493 494 495 496 497 
  4   5   1   3   1   3   5   3   1   3   5   5   8   5   1   5   2   5   2   3 
498 500 505 506 508 509 510 511 512 513 514 515 516 518 519 520 521 522 523 526 
  2   6   1   2   5   1   2   4   2   5   3   2   5   2   4   3   1   2   3   3 
527 528 529 530 531 534 535 536 538 539 540 541 542 543 544 545 546 548 549 550 
  3   5   1   1   1   1   4   1   2   1   4   4   3   1   3   6   4   2   1   1 
551 553 554 555 556 557 560 561 562 563 564 566 567 569 570 571 572 573 576 579 
  1   1   3   5   2   3   2   2   6   5   2   7   2   1   3   4   2   2   1   2 
581 593 594 595 598 601 604 
  2   2   1   1   1   2   1 
1 least connected region:
1732 with 6 links
1 most connected region:
3263 with 604 links

Weights style: W 
Weights constants summary:
     n       nn   S0       S1       S2
W 3824 14622976 3824 47.67099 15430.41

Lastly, lm.morantest() of spdep package will be used to perform Moran’s I test for residual spatial autocorrelation to validate if our inital observation if there are signs of spatial autocorrelation

moran_test <-lm.morantest(resale.mlr, nb_lw)
Interpretation of Moran’s I Test Results

The Moran’s I test for residual spatial autocorrelation was conducted using the lm.morantest() function. Here’s a breakdown of the results:

Moran I Statistic:

The observed Moran’s I value is 0.299, which indicates a positive spatial autocorrelation in the residuals. This suggests that similar values tend to be spatially clustered, which is a sign of spatial dependence that hasn’t been fully captured by the regression model.

Significance (p-value): The p-value is extremely low (p-value < 2.2e-16) which strongly rejects the null hypothesis of no spatial autocorrelation. This statistically confirms that there is significant spatial autocorrelation in the residuals of the model.

write_rds(moran_test, "data/HDB/rds/moran_resale_sp.rds")
moran_test <- read_rds("data/HDB/rds/moran_resale_sp.rds")

11 Building Geographical Random Forest Model

To prepare for calibrating the random forest model, we first need to generate coordinate data required by the SpatialML package. This can be accomplished by using st_coordinates() from the sf package. After extracting the coordinates, we will remove the geometry from resale_train using st_drop_geometry(), creating a non-spatial dataset ready for analysis.

coords_train <- st_coordinates(resale_train)
coords_train <- write_rds(coords_train, "data/HDB/rds/coords_train.rds") # writing output into rds for future use
coords_train <- read_rds("data/HDB/rds/coords_train.rds")

Using st_drop_geometry()

resale_train_nogeo <- resale_train %>% 
  st_drop_geometry()

11.1 Computing adaptive bandwidth

bw.gwr() of GWmodel package will be used to determine the optimal bandwidth to be used for the Random Forest Model. An adaptive bandwidth will be applied to allow the bandwidth to adjust based on the density of data points, offering greater flexibility in regions with sparse data. Also, to determine the optimal bandwidth, we will use a cross-validation (CV) approach with the bw.gwr() function as well.

bw_adaptive <- bw.gwr(formula = RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE +
                     PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT + PROX_PARK +
                     PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET + PROX_BUS_STOP + NUM_KINDERGARTEN +
                     NUM_CHILDCARE + NUM_BUS_STOP + NUM_CHAS,
                      data = resale_sp, 
                      approach = "CV", 
                      kernel = "gaussian", 
                      adaptive = TRUE, 
                      longlat = FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
Adaptive bandwidth: 2371 CV score: 1.368532e+13 
Adaptive bandwidth: 1473 CV score: 1.288663e+13 
Adaptive bandwidth: 918 CV score: 1.161493e+13 
Adaptive bandwidth: 574 CV score: 1.015948e+13 
Adaptive bandwidth: 363 CV score: 8.471624e+12 
Adaptive bandwidth: 230 CV score: 7.183122e+12 
Adaptive bandwidth: 151 CV score: 5.939131e+12 
Adaptive bandwidth: 98 CV score: 5.153881e+12 
Adaptive bandwidth: 70 CV score: 4.78133e+12 
Adaptive bandwidth: 47 CV score: 4.624679e+12 
Adaptive bandwidth: 39 CV score: 4.564377e+12 
Adaptive bandwidth: 27 CV score: 8.95862e+12 
Adaptive bandwidth: 39 CV score: 4.564377e+12 

The result shows that 64 neighbour points will be the optimal bandwidth to be used if adaptive bandwidth is used for this data set.

write_rds(bw_adaptive, "data/HDB/model/bw_adaptive.rds")
bw_adaptive <- read_rds("data/HDB/model/bw_adaptive.rds")
bw_adaptive
[1] 39

11.2 Calibrating Random Forest Model

With the results from the adaptive bandwidth computed earlier on, this section will illustrate the calibration of a Geographically Weighted Random Forest Model using grf() of SpatialML package.

Code chunk below is used to verify the variable names before proceeding.

names(resale_train_nogeo)
 [1] "MONTH"             "TOWN"              "ADDRESS"          
 [4] "REMAINING_LEASE"   "FLOOR_AREA_SQM"    "UNIT_AGE"         
 [7] "RESALE_PRICE"      "LEVEL"             "PROX_CBD"         
[10] "PROX_ELDERCARE"    "PROX_CHAS"         "PROX_HAWKER"      
[13] "PROX_MRT"          "PROX_PARK"         "PROX_CHILDCARE"   
[16] "PROX_KINDERGARTEN" "PROX_SUPERMARKET"  "PROX_BUS_STOP"    
[19] "NUM_KINDERGARTEN"  "NUM_CHILDCARE"     "NUM_BUS_STOP"     
[22] "NUM_CHAS"         
names(resale_train)
 [1] "MONTH"             "TOWN"              "ADDRESS"          
 [4] "REMAINING_LEASE"   "FLOOR_AREA_SQM"    "UNIT_AGE"         
 [7] "RESALE_PRICE"      "LEVEL"             "PROX_CBD"         
[10] "PROX_ELDERCARE"    "PROX_CHAS"         "PROX_HAWKER"      
[13] "PROX_MRT"          "PROX_PARK"         "PROX_CHILDCARE"   
[16] "PROX_KINDERGARTEN" "PROX_SUPERMARKET"  "PROX_BUS_STOP"    
[19] "NUM_KINDERGARTEN"  "NUM_CHILDCARE"     "NUM_BUS_STOP"     
[22] "NUM_CHAS"          "geometry"         
str(resale_train_nogeo)
tibble [3,824 × 22] (S3: tbl_df/tbl/data.frame)
 $ MONTH            : chr [1:3824] "2023-08" "2023-09" "2023-08" "2023-09" ...
 $ TOWN             : chr [1:3824] "TAMPINES" "JURONG WEST" "PUNGGOL" "KALLANG/WHAMPOA" ...
 $ ADDRESS          : chr [1:3824] "879A TAMPINES AVE 8" "681C JURONG WEST CTRL 1" "632A PUNGGOL DR" "2C UPP BOON KENG RD" ...
 $ REMAINING_LEASE  : num [1:3824] 92.9 75.9 83.4 82.2 93.6 ...
 $ FLOOR_AREA_SQM   : num [1:3824] 93 95 93 90 93 93 93 105 102 92 ...
 $ UNIT_AGE         : num [1:3824] 6.08 23.08 15.58 16.83 5.42 ...
 $ RESALE_PRICE     : num [1:3824] 702888 525000 560000 880000 600000 ...
 $ LEVEL            : num [1:3824] 11 14 14 14 8 14 11 2 8 20 ...
 $ PROX_CBD         : num [1:3824] 10722 17409 13699 3085 17099 ...
 $ PROX_ELDERCARE   : num [1:3824] 453 792 1220 885 1859 ...
 $ PROX_CHAS        : num [1:3824] 74.6 157.3 230.9 104.2 59.3 ...
 $ PROX_HAWKER      : num [1:3824] 1203 893 1260 211 1016 ...
 $ PROX_MRT         : num [1:3824] 1254 667 199 197 360 ...
 $ PROX_PARK        : num [1:3824] 1591 789 1369 755 502 ...
 $ PROX_CHILDCARE   : num [1:3824] 74.6 49.3 160.6 99.8 55.5 ...
 $ PROX_KINDERGARTEN: num [1:3824] 304 285 419 146 327 ...
 $ PROX_SUPERMARKET : num [1:3824] 74.6 497.4 241.9 103.7 341.9 ...
 $ PROX_BUS_STOP    : num [1:3824] 73.1 170 140 28.6 124.7 ...
 $ NUM_KINDERGARTEN : num [1:3824] 1 1 0 1 1 1 1 0 1 0 ...
 $ NUM_CHILDCARE    : num [1:3824] 4 5 5 4 6 11 5 2 7 5 ...
 $ NUM_BUS_STOP     : num [1:3824] 10 11 7 9 9 7 8 6 11 4 ...
 $ NUM_CHAS         : num [1:3824] 1 1 4 5 5 2 4 0 3 3 ...

The code chunk below is used to calibrate a model to predict HDB four-room resale price by using random forest function of ranger package.

set.seed(1234)

rf <- ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE +
             PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT + PROX_PARK +
             PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET + PROX_BUS_STOP + NUM_KINDERGARTEN +
             NUM_CHILDCARE + NUM_BUS_STOP + NUM_CHAS,
             data = resale_train_nogeo)
rf
Ranger result

Call:
 ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE +      PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT +      PROX_PARK + PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET +      PROX_BUS_STOP + NUM_KINDERGARTEN + NUM_CHILDCARE + NUM_BUS_STOP +      NUM_CHAS, data = resale_train_nogeo) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      3824 
Number of independent variables:  17 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       1803594630 
R squared (OOB):                  0.8965837 
write_rds(rf, "data/HDB/model/rf.rds")
rf <- read_rds("data/HDB/model/rf.rds")
rf
Ranger result

Call:
 ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE +      PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT +      PROX_PARK + PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET +      PROX_BUS_STOP + NUM_KINDERGARTEN + NUM_CHILDCARE + NUM_BUS_STOP +      NUM_CHAS, data = resale_train_nogeo) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      3824 
Number of independent variables:  17 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       1803594630 
R squared (OOB):                  0.8965837 

The model output is also saved in the following code chunk below.

  • Number or trees (e.g., ntree = 50)

  • 200 is selected because from earlier experiments with 500 trees, the gain in accuracy was minimal (a slight improvement in R-squared and OOB MSE). Lower number of trees can be a good compromise for faster computation and reducing complexity.Ideally we should let the model run and determine the number of trees which might be up to 500.

  • kernel = "adaptive"

  • ensures that the model adjusts spatial influence based on local data density

  • verbose = TRUE is helpful for debugging and monitoring the model’s progress.

pacman::p_load(parallel, profvis)
set.seed(1234)

gwRF_adaptive <- grf(formula = RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE +
                     PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT + PROX_PARK +
                     PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET + PROX_BUS_STOP + NUM_KINDERGARTEN +
                     NUM_CHILDCARE + NUM_BUS_STOP + NUM_CHAS,
                     dframe = resale_train_nogeo, 
                     bw = bw_adaptive, # based on earlier computed adaptive bandwidth
                     ntree = 200, 
                     kernel = "adaptive",
                     verbose = TRUE,
                     coords = coords_train)
Ranger result

Call:
 ranger(RESALE_PRICE ~ FLOOR_AREA_SQM + LEVEL + REMAINING_LEASE +      PROX_CBD + PROX_ELDERCARE + PROX_CHAS + PROX_HAWKER + PROX_MRT +      PROX_PARK + PROX_CHILDCARE + PROX_KINDERGARTEN + PROX_SUPERMARKET +      PROX_BUS_STOP + NUM_KINDERGARTEN + NUM_CHILDCARE + NUM_BUS_STOP +      NUM_CHAS, data = resale_train_nogeo, num.trees = 200, mtry = 5,      importance = "impurity", num.threads = NULL, verbose = TRUE) 

Type:                             Regression 
Number of trees:                  200 
Sample size:                      3824 
Number of independent variables:  17 
Mtry:                             5 
Target node size:                 5 
Variable importance mode:         impurity 
Splitrule:                        variance 
OOB prediction error (MSE):       1737145553 
R squared (OOB):                  0.9003939 
   FLOOR_AREA_SQM             LEVEL   REMAINING_LEASE          PROX_CBD 
     3.263165e+12      1.047779e+13      1.488454e+13      2.077840e+13 
   PROX_ELDERCARE         PROX_CHAS       PROX_HAWKER          PROX_MRT 
     2.260927e+12      8.291013e+11      2.772082e+12      2.192365e+12 
        PROX_PARK    PROX_CHILDCARE PROX_KINDERGARTEN  PROX_SUPERMARKET 
     1.863344e+12      7.977834e+11      1.113445e+12      1.092576e+12 
    PROX_BUS_STOP  NUM_KINDERGARTEN     NUM_CHILDCARE      NUM_BUS_STOP 
     8.380639e+11      3.849591e+11      5.332950e+11      5.872851e+11 
         NUM_CHAS 
     1.461604e+12 
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-402888.9  -21537.6    -488.3    -917.3   21066.1  649147.3 
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-114283.74   -3516.67     -56.18     -32.67    3518.50  103936.38 
                        Min          Max        Mean         StD
FLOOR_AREA_SQM            0 327369341307 18564730799 36283978234
LEVEL             179517210 237066921698 17357244435 26558883550
REMAINING_LEASE   476456004 610467337521 46169698946 77798113135
PROX_CBD          159160269 398838028133 13601543671 29690681093
PROX_ELDERCARE    229472032 538119933435 12116733479 26711034015
PROX_CHAS         169045122 164752719384  7669699018 12780222753
PROX_HAWKER       205849444 212345427516 11246287515 20680809603
PROX_MRT          197451249 325520355339 11680829832 23418978745
PROX_PARK         150692058 324539620889 12175408296 22684720379
PROX_CHILDCARE    161726549 332142137300  9466247889 21304593306
PROX_KINDERGARTEN 168923960 497657122214 11349619493 30649799599
PROX_SUPERMARKET  173000086 226481570982  9960133501 20427808278
PROX_BUS_STOP      97177838 221479676375  8236757119 20192427806
NUM_KINDERGARTEN          0 107244057163  1587110186  4979233427
NUM_CHILDCARE      15035261 203734234177  5476256383 14279107513
NUM_BUS_STOP       34946054 304540033154  6539865575 19902376737
NUM_CHAS                  0 262309235853  5966025174 16453101580
Result Breakdown

Based on the results and parameters, it can be observed that

Top 5 Predictors by Importance The top six predictors, based on their importance scores, are as follows: 1. FLOOR_AREA_SQM: Interpretation: Floor area is the most important predictor in the model, which aligns with the general understanding that larger flats tend to have higher resale values. Geographical Implication: This factor is likely uniformly significant across areas, as floor area directly impacts the valuation irrespective of location.

  1. LEVEL: Interpretation: The floor level of the unit is also highly significant, likely due to the premium placed on higher floors for better views, privacy, and less noise.

  2. REMAINING_LEASE:

Interpretation: The remaining lease length strongly influences resale value, as longer leases provide more value to buyers, especially in Singapore where HDB leases are finite.

  1. PROX_CBD:

Interpretation: Proximity to the Central Business District (CBD) is a major driver of value, as it suggests better access to job opportunities, amenities, and transport options.

Secondary

  1. PROX_HAWKER : Interpretation: Proximity to hawker centres, an important amenity in Singapore, significantly impacts HDB resale prices, as they offer affordable dining options.

  2. PROX_MRT:

Interpretation: Proximity to MRT stations is another key factor, as convenient public transportation access is highly valued in Singapore’s urban environment.

  1. PROX_PARK and PROX_ELDERCARE also contribute, albeit to a lesser degree, highlighting the value of recreational spaces and proximity to eldercare facilities.

The top predictors identified by this geographically weighted random forest model, particularly are floor area, level, and remaining lease, aligning with general expectations in HDB pricing. Location-specific amenities (e.g., proximity to CBD, MRT, and hawker centres) further emphasize the premium placed on convenience and accessibility in Singapore’s real estate market. The high R-squared value and reasonable OOB MSE suggest a robust model that captures both structural and spatial elements effectively.

  • Less influential variables:

Variables like NUM_KINDERGARTEN, PROX_CHILDCARE, and NUM_BUS_STOP have lower importance values, suggesting they contribute less to resale price variation.

Model Performance
  • R-squared (Not OOB): 98.559% This indicates that the model explains 98.56% of the variance in the resale prices. This seems to be a good fit for the data, suggesting the model is effective at capturing key predictors.

11.2.1 Saving the model output

Code chunk below is used to save the results as an rds file.

write_rds(gwRF_adaptive, "data/HDB/model/gwRF_adaptive.rds")
gwRF_adaptive <- read_rds("data/HDB/model/gwRF_adaptive.rds")

12 Model Evaluation (Predicting using test data)

To begin, we will prepare the test dataset, consisting of transactions from (Q3) July to September 2024. Like the training data, the test data requires coordinate information for use with the SpatialML package. These coordinates can be extracted using st_coordinates() from the sf package. Afterwards, the geometry data will be removed from resale_test using st_drop_geometry(), resulting in an aspatial dataset ready for analysis.

coords_test <- st_coordinates(resale_test)
coords_test <- write_rds(coords_test, "data/HDB/rds/coords_test.rds")

resale_test_nogeo <- cbind(resale_test, coords_test) %>%
  st_drop_geometry()

12.1 Predicting with the test data

The Multiple Linear Regression (MLR) will first be used to make predictions based on the test data using the function predict() with the results stored in the test dataset for model evaluation afterwards.

resale_test$MLR_PREDICT <- predict(object = resale.mlr, newdata = resale_test)

12.2 Predicting with Geographically Weighted Random Forest

Next, predict.grf() of spatialML package will be used to predict the resale value by using the resale_test data and gwRF_adaptive model calibrated earlier.

resale_test$GWRF_PREDICT <- predict.grf(gwRF_adaptive,
                                          resale_test_nogeo, 
                                          x.var.name = "X",
                                          y.var.name = "Y", 
                                          local.w = 1,
                                          global.w = 0)

Before moving on the output is saved as an rds file for future usage

GRF_pred <- write_rds(resale_test, "data/HDB/rds/resale_test_pred.rds")
GRF_pred <- read_rds("data/HDB/rds/resale_test_pred.rds")

13 Calculation of Root Mean Square Error (RMSE)

The root mean square error (RMSE) allows us to measure how far predicted values are from observed values in a regression analysis. In the code chunk below, rmse() of Metrics package is used to compute the RMSE.

The output of the predict.grf() is a vector of predicted values. It is converted into a data frame for further visualisation and analysis.

GRF_pred_df <- as.data.frame(GRF_pred)

13.1 RMSE for Geographically Weighted Random Forest

rmse_gwrf <- rmse(resale_test$RESALE_PRICE, 
                  resale_test$GWRF_PREDICT)

rmse_gwrf
[1] 84847.19

13.2 RMSE for Multiple Linear Regression

rmse_mlr <- rmse(resale_test$RESALE_PRICE,
                 resale_test$MLR_PREDICT)

rmse_mlr
[1] 89310.01

13.3 RMSE Results (Table)

The RMSE results for each model can then be stored as a table format for reference using the head() function.

rmse_mlr <- rmse(resale_test$RESALE_PRICE, resale_test$MLR_PREDICT)
rmse_gwrf <- rmse(resale_test$RESALE_PRICE, resale_test$GWRF_PREDICT)

rmse <- data.frame(
  Model = c("Multiple Linear Regression", "Geographically Weighted Random Forest"),
  RMSE = c(rmse_mlr, rmse_gwrf)
)

head(rmse)
                                  Model     RMSE
1            Multiple Linear Regression 89310.01
2 Geographically Weighted Random Forest 84847.19

13.4 Visualising and Analysing the Model Prediction Results

Additionally, a scatter plot can be used to visualise the actual resale price and the predicted resale price of both models by using the code chunk below.

Note

A better predictive model should have the scatter point close to the diagonal line. The scatter plot can be also used to detect if any outliers are in the model.

plot_mlr <- ggplot(data = resale_test,
                   aes(x = MLR_PREDICT,
                       y = RESALE_PRICE)) +
  geom_point() +
  ggtitle("Actual Resale Price vs Predicted Price (MLR)")

plot_gwrf <- ggplot(data = resale_test,
                    aes(x = GWRF_PREDICT,
                        y = RESALE_PRICE)) +
  geom_point() + 
  ggtitle("Actual Resale Price vs Predicted Price (GWRF)")

ggarrange(plot_mlr, plot_gwrf, ncol = 2)

Summary and Model Performance

RMSE Comparison:

  • The RMSE for Multiple Linear Regression (MLR) is 89310.01, while for Geographically Weighted Random Forest (GWRF), it is 84847.19. The lower RMSE for GWRF indicates better predictive performance compared to MLR, suggesting that incorporating spatial information improves prediction accuracy.

Scatter Plot Analysis:

  • Both models show scatter points roughly aligned along the diagonal, indicating reasonable predictive capability. The GWRF model has a tighter clustering of points closer to the diagonal, implying better alignment between actual and predicted resale prices.

Model Interpretation:

  • The GWRF model demonstrates superior performance due to its ability to account for spatial heterogeneity and relationships, as seen in the reduced RMSE and improved scatter alignment.

  • This aligns with the observed spatial autocorrelation in the dataset, highlighting the value of using a geographically weighted model.

Recommendation:

  • For more reliable results for a model, the GWRF should be preferred for tasks requiring precise predictions of resale prices, especially in spatially non-uniform datasets with evident spatial autocorrelation.

14 Conclusion from results

The objective of this exercise was to predict HDB resale prices effectively while accounting for spatial relationships in the data. By comparing the performance of Multiple Linear Regression (MLR) and Geographically Weighted Random Forest (GWRF) models, it was evident that the GWRF model outperformed the MLR model. This was reflected in the lower Root Mean Square Error (RMSE) achieved by the GWRF model, which stood at 84847.19 compared to 89310.01 for the MLR model. The improved accuracy of the GWRF model underscores the importance of addressing spatial heterogeneity when modelling geographically distributed data.

The dataset used in this exercise exhibited signs of spatial autocorrelation, necessitating the use of spatially-aware models. GWRF effectively leveraged spatial relationships to capture local variations in factors influencing resale prices. The use of an adaptive bandwidth further optimized the model by allowing the kernel to adjust dynamically based on data density, thereby enhancing its performance. Additionally, the analysis of variable importance highlighted that factors such as REMAINING_LEASE, proximity to the central business district (CBD), and FLOOR_AREA_SQM were the most influential drivers of resale prices. These insights underscore the importance of focusing on these attributes in future analyses or pricing strategies.

Visual validation through scatter plots of predicted versus actual resale prices demonstrated that the GWRF model produced predictions that aligned more closely with observed values compared to the MLR model. The points in the GWRF scatter plot clustered nearer to the diagonal, further reinforcing its superior predictive capabilities. From a practical perspective, for scenarios involving non-uniform spatial data with evident spatial autocorrelation, the GWRF model proves to be a more robust and reliable choice compared to traditional regression techniques. Its ability to incorporate local variability and spatial relationships makes it particularly well-suited for predictive modelling of geographically distributed data.

In conclusion, this exercise demonstrates the importance of integrating spatial methodologies in real estate pricing. The findings highlight the value of geographically weighted modelling approaches in addressing spatial heterogeneity and improving prediction accuracy. Future analyses should continue to leverage spatially-aware models such as GWRF while refining bandwidth parameters and incorporating additional spatial covariates to further enhance performance. The generalizability of this approach should also be evaluated on other property types to assess its broader applicability.

15 References

Kam, T.S. (2024). Calibrating Hedonic Pricing Model for Private Highrise Property with GWR Method

Kam, T.S. (2024). Geographically Weighted Predictive Models

IS415 Megan Sim Take-Home Exercise 3(2021). Retrieved from https://is415-msty.netlify.app/posts/2021-10-25-take-home-exercise-3/